Introduction

The Global Health Expenditure Database (GHED) provides time series data from 2000-2022 which covers country-level spending in the world. Healthcare in the USA is notably more expensive than its counter-parts across the world. This dataset will allow us to examine how different countries allocate their funds towards different segments of healthcare. For example, we may be able to explore how much spending is being allocated towards injuries, for example.

To explore the given dataset, we will look at three subsets in attempts to uncover interesting potential leads. The motivation behind this stems from the large presence of missing values that are in the dataset. We will demonstrate this below. Furthermore, variable names are not intuitive. Creating more meaningful subsets of the larger dataset will allow us to explore potentially interesting questions.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)
GHED_data <- read_excel("GHED_data.XLSX", sheet = 1)

# View the structure of the dataset
str(GHED_data)
## tibble [4,244 × 3,923] (S3: tbl_df/tbl/data.frame)
##  $ country                 : chr [1:4244] "Algeria" "Algeria" "Algeria" "Algeria" ...
##  $ code                    : chr [1:4244] "DZA" "DZA" "DZA" "DZA" ...
##  $ region                  : chr [1:4244] "AFR" "AFR" "AFR" "AFR" ...
##  $ income                  : chr [1:4244] "Lower-middle" "Lower-middle" "Lower-middle" "Lower-middle" ...
##  $ year                    : num [1:4244] 2000 2001 2002 2003 2004 ...
##  $ che_gdp                 : num [1:4244] 3.49 3.84 3.73 3.6 3.54 ...
##  $ che_pc_usd              : num [1:4244] 62.1 67.3 66.9 76.2 93 ...
##  $ hk_gdp                  : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ hk_g_gdp                : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ hk_ext_gdp              : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ che                     : num [1:4244] 143870 162231 168702 189137 217929 ...
##  $ gghed                   : num [1:4244] 103534 123664 126997 145057 155500 ...
##  $ pvtd                    : num [1:4244] 40261 38492 41630 43985 62327 ...
##  $ ext                     : num [1:4244] 75.1 75.1 75.1 95 102 ...
##  $ dom_che                 : num [1:4244] 99.9 100 100 99.9 100 ...
##  $ gghed_che               : num [1:4244] 72 76.2 75.3 76.7 71.4 ...
##  $ pvtd_che                : num [1:4244] 28 23.7 24.7 23.3 28.6 ...
##  $ oops_che                : num [1:4244] 25.8 21.7 22.5 21.1 26.1 ...
##  $ vpp_che                 : num [1:4244] 0.818 0.848 0.954 0.981 1.445 ...
##  $ ext_che                 : num [1:4244] 0.0522 0.0463 0.0445 0.0502 0.0468 ...
##  $ gghed_gdp               : num [1:4244] 2.51 2.93 2.81 2.76 2.53 ...
##  $ gghed_gge               : num [1:4244] 8.78 9.27 7.97 9.44 8.67 ...
##  $ gghed_pc_usd            : num [1:4244] 44.7 51.3 50.4 58.5 66.4 ...
##  $ pvtd_pc_usd             : num [1:4244] 17.4 16 16.5 17.7 26.6 ...
##  $ oop_pc_usd              : num [1:4244] 16 14.6 15.1 16.1 24.3 ...
##  $ ext_pc_usd              : num [1:4244] 0.0324 0.0312 0.0298 0.0383 0.0435 ...
##  $ tran_shi                : num [1:4244] 0 0 0 0 0 0 0 0 0 0 ...
##  $ shise_shi               : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ cfa_che                 : num [1:4244] 72 76.2 75.3 76.7 71.4 ...
##  $ gfa_che                 : num [1:4244] 45.9 50.5 49.3 50.7 47.1 ...
##  $ chi_che                 : num [1:4244] 26.1 25.8 25.9 26 24.2 ...
##  $ shi_che                 : num [1:4244] 26.1 25.8 25.9 26 24.2 ...
##  $ chi_pvt_che             : num [1:4244] 0 0 0 0 0 0 0 0 0 0 ...
##  $ vfa_che                 : num [1:4244] 28 23.8 24.7 23.3 28.6 ...
##  $ vhi_che                 : num [1:4244] 0.818 0.848 0.954 0.981 1.445 ...
##  $ row_che                 : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ phc_usd_pc              : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ phc_che                 : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ phc_gghed_usd_pc        : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ gghed_phc_gghed         : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ gghed_phc_phc           : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ phc_ext_usd_pc          : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ ext_phc_ext             : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ ext_phc_phc             : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ phc_public_gdp          : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ phc_pvtd_usd_pc         : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ pvtd_phc_pvtd           : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ pvtd_phc_phc            : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ hc62_che                : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ hc62_g_gghed            : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ hc62_ext_ext            : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ fp3214_che              : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ fp3214_gghed_fp3214     : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ fp3214_ext_fp3214       : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ dis1_che                : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ dis11_che               : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ dis12_che               : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ dis13_che               : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ dis16_che               : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ dis2_che                : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ dis21_che               : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ dis22_che               : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ dis23_che               : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ dis3_che                : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ dis4_che                : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ dis5_che                : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ disnec_che              : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ dis1_g_gghed            : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ dis11_g_gghed           : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ dis12_g_gghed           : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ dis13_g_gghed           : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ dis16_g_gghed           : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ dis2_g_gghed            : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ dis21_g_gghed           : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ dis22_g_gghed           : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ dis23_g_gghed           : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ dis3_g_gghed            : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ dis4_g_gghed            : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ dis5_g_gghed            : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ disnec_g_gghed          : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ dis1_ext_ext            : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ dis11_ext_ext           : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ dis12_ext_ext           : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ dis13_ext_ext           : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ dis16_ext_ext           : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ dis2_ext_ext            : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ dis21_ext_ext           : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ dis22_ext_ext           : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ dis23_ext_ext           : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ dis3_ext_ext            : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ dis4_ext_ext            : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ dis5_ext_ext            : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ disnec_ext_ext          : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ hccov_che               : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ hccov_usd_pc            : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ hccov_gghed_gghed       : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ hccov_ext_ext           : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ gge_gdp                 : num [1:4244] 28.6 31.6 35.2 29.3 29.2 ...
##  $ gdp_pc_usd              : num [1:4244] 1780 1755 1795 2117 2625 ...
##   [list output truncated]

We can notice that the size of the datasset is 4,224 x 3,923, representing a large number of both observations and variables. The dataset does not provide much value as it is. Additionally, we can begin to see the missing value problem that was previously mentioned.

GHED_data_filtered <- GHED_data %>%
  filter(year == 2021)

len <- length(GHED_data)

missing_values <- as.data.frame(colSums(is.na(GHED_data_filtered)))
colnames(missing_values) <- "n"
missing_values$n <- missing_values$n / len

ggplot(missing_values, aes(x = n)) +
  geom_histogram() +
  labs(title = "Proportion of Missing Values in 2021", x = "proportion")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

In just one year, the proportion of rows that have missing values vary greatly, ranging from 0% - 5%. From here on, we will begin to explore subsets of the data. The following three ideas will be investigated.

  1. Vaccines are known to reduce the incidence rate of infectious diseases. Thus, we want to investigate if higher spending on vaccines reduces the amount countries need to spend on infectious disease treatment. We will use the following variables:
  1. Preventative services are often important in reducing long-term health costs and improving population health. Therefore it is crucial to examine how do healthcare financing, specifically out-of-pocket (OOPS) payments and government healthcare spending (GGHE-D), influence access to healthcare services, including preventative care, across different income groups and regions. Understanding the financial barriers across different income groups and regions can provide valuable insights for policymakers seeking to improve health equity and design more effective healthcare systems.
  1. Healthcare in America is marked by large out of pocket (OOP) costs. At the same time, it seems that we pay more in premiums than ever. Thus, it may be interesting to investigate how healthcare expenses change as the amount of government intervention increases. For example, do countries that have more government contribution allocate their funds differently? We will use the following variables to lay the groundwork to answer this question:

Section 1: Disease Specific Spending

To begin this section, we will subset a section of data to use, as well as renaming certain variables to make it more readable and understandable to the viewers. This section is to look into how each countries have spent on specific disease as a percentage of their total health expenditure.

In this section we will take a look at: - Infectious and Parasitic Diseases - HIV/Aids and Sexually transmitted diseases - Tuberculosis - Malaria - Neglected Tropical Diseases - Reproductive Health - Maternal Conditions - Perinatal Conditions - Contraceptive Management - Nutritional Deficiencies - Injuries - Other unspecified diseases/conditions

For this subset we will choose the year of 2019 as the baseline as it represents the most recent snapshot of pre-COVID medical expenses

disease_dataset <- GHED_data %>%
  rename(
    infectious = dis1_che,
    hiv_aids = dis11_che,
    tuberculosis = dis12_che,
    malaria = dis13_che,
    neg_tropical = dis16_che,
    reproductive = dis2_che,
    maternal = dis21_che,
    perinatal = dis22_che,
    contraceptive = dis23_che,
    nutrition = dis3_che,
    noncomm = dis4_che,
    injuries = dis5_che,
    other = disnec_che
  ) %>%
  select(
    country,
    region,
    year,
    hiv_aids,
    tuberculosis,
    malaria,
    neg_tropical,
    reproductive,
    maternal,
    perinatal,
    contraceptive,
    nutrition,
    injuries,
    other
  )
  head(disease_dataset)
## # A tibble: 6 × 14
##   country region  year hiv_aids tuberculosis malaria neg_tropical reproductive
##   <chr>   <chr>  <dbl>    <dbl>        <dbl>   <dbl>        <dbl>        <dbl>
## 1 Algeria AFR     2000       NA           NA      NA           NA           NA
## 2 Algeria AFR     2001       NA           NA      NA           NA           NA
## 3 Algeria AFR     2002       NA           NA      NA           NA           NA
## 4 Algeria AFR     2003       NA           NA      NA           NA           NA
## 5 Algeria AFR     2004       NA           NA      NA           NA           NA
## 6 Algeria AFR     2005       NA           NA      NA           NA           NA
## # ℹ 6 more variables: maternal <dbl>, perinatal <dbl>, contraceptive <dbl>,
## #   nutrition <dbl>, injuries <dbl>, other <dbl>

Looking through the current subset, it seems that there are areas that are missing information or may contain rows that only have limited information. In order to be able to provide a detailed analysis for this section we will need to clean the dataset. We will first take a look at the overall structure of the dataset.

str(disease_dataset)
## tibble [4,244 × 14] (S3: tbl_df/tbl/data.frame)
##  $ country      : chr [1:4244] "Algeria" "Algeria" "Algeria" "Algeria" ...
##  $ region       : chr [1:4244] "AFR" "AFR" "AFR" "AFR" ...
##  $ year         : num [1:4244] 2000 2001 2002 2003 2004 ...
##  $ hiv_aids     : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ tuberculosis : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ malaria      : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ neg_tropical : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ reproductive : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ maternal     : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ perinatal    : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ contraceptive: num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ nutrition    : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ injuries     : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ other        : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...

Looking through this, we will need to filter out years that we do not need for this section, so to filter through to only 2019 data.

disease_filtered <- subset(disease_dataset, year %in% c(2019))
str(disease_filtered)
## tibble [192 × 14] (S3: tbl_df/tbl/data.frame)
##  $ country      : chr [1:192] "Algeria" "Angola" "Benin" "Botswana" ...
##  $ region       : chr [1:192] "AFR" "AFR" "AFR" "AFR" ...
##  $ year         : num [1:192] 2019 2019 2019 2019 2019 ...
##  $ hiv_aids     : num [1:192] NA NA NA 20.6 3.77 ...
##  $ tuberculosis : num [1:192] NA NA NA 3.386 0.235 ...
##  $ malaria      : num [1:192] NA NA NA 5.79 15.22 ...
##  $ neg_tropical : num [1:192] NA NA NA 1.24 4.88 ...
##  $ reproductive : num [1:192] NA NA NA 10 14.5 ...
##  $ maternal     : num [1:192] NA NA NA 2.96 9.24 ...
##  $ perinatal    : num [1:192] NA NA NA NA 3.99 ...
##  $ contraceptive: num [1:192] NA NA NA NA 1.28 ...
##  $ nutrition    : num [1:192] NA NA NA 1.739 0.725 ...
##  $ injuries     : num [1:192] NA NA NA 3.66 1.76 ...
##  $ other        : num [1:192] NA NA NA 0.0755 33.9561 ...

However, in this subset, there contains rows where there are only few amounts of data, and a few that are missing only a certain amount of data. For this, we will create a quick function to look through the specific columns listec below to check for the ones that we want to keep. We will keep the rows that contain less missing values than actual values that are in the row.

columns_to_check <- c("hiv_aids", "tuberculosis", "malaria", "neg_tropical", "reproductive", "maternal", "perinatal", "contraceptive", "nutrition", "injuries", "other")
clean_disease <- disease_filtered[rowSums(is.na(disease_filtered[columns_to_check])) < length(columns_to_check), ]

Now we want to take a look at the region specific data. We want to know which regions are most prominent in this subset now. From what we can see of this dataset, the most prominent region is the Africa region.

This dataset was collected by the World Health Organization, it has more data available for the Africa region, since they have numerous health initiatives as Africa faces a significant burden of communicable diseases, leading to more focus and more data collection to understand the underlying issues faced by those in the region.

However, the amount of data collected by other regions may be too low, in comparison, which could lead to other questions such as there is a signifcant lack of data for these areas specifically? Could it be a logistics issue or perhaps the government not willing to share the specific datasets.

region_counts_df <- clean_disease %>%
  group_by(region) %>%
  summarise(Count = n(), .groups = 'drop') %>%
  filter(!is.na(region))
print(region_counts_df)
## # A tibble: 6 × 2
##   region Count
##   <chr>  <int>
## 1 AFR       36
## 2 AMR        2
## 3 EMR        2
## 4 EUR        6
## 5 SEAR       1
## 6 WPR        3

Plotting Region Specific Stacked Bar Charts

# Stacked Bar Chart of AFR Region
ggplot(clean_disease %>% filter(region == "AFR") %>%
         pivot_longer(cols = c(hiv_aids, tuberculosis, malaria,
                                neg_tropical, reproductive, maternal,
                                perinatal, contraceptive, nutrition,
                                injuries, other),
                      names_to = "Metric",
                      values_to = "Percentage"),
       aes(x = country, y = Percentage, fill = Metric)) +
  geom_bar(stat = "identity", position = "stack") +
  labs(title = "Percentage of Health Metrics by Country (AFR Region)",
       x = "Country",
       y = "Percentage") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "bottom") +  # Place legend at the bottom
  scale_y_continuous(labels = scales::percent_format(scale = 1))
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_bar()`).

In the Africa region it seems that the highest spending is on malaria, this can be explained by the mosquitos of that region having a longer life-span in comparison to those of other regions. Even though it is a disease that can be treated with care, in a region where it is hard to do so, prevents Malaria from being easily treated.

# Stacked Bar Chart of AMR Region
ggplot(clean_disease %>% filter(region == "AMR") %>%
         pivot_longer(cols = c(hiv_aids, tuberculosis, malaria,
                                neg_tropical, reproductive, maternal,
                                perinatal, contraceptive, nutrition,
                                injuries, other),
                      names_to = "Metric",
                      values_to = "Percentage"),
       aes(x = country, y = Percentage, fill = Metric)) +
  geom_bar(stat = "identity", position = "stack") +
  labs(title = "Percentage of Health Metrics by Country (AMR Region)",
       x = "Country",
       y = "Percentage") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "bottom") +
  scale_y_continuous(labels = scales::percent_format(scale = 1))
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_bar()`).

In the Americas, there is a huge lack of data in this area, which can be explained by the WHO keeping the United States data seperate from the data making it hard to analyze both at the same time. However the lack of data in the South of the Americas is suprising which could stem from logistics issues or other limiting factors.

# Stacked Bar Chart of EMR Region
ggplot(clean_disease %>% filter(region == "EMR") %>%
         pivot_longer(cols = c(hiv_aids, tuberculosis, malaria,
                                neg_tropical, reproductive, maternal,
                                perinatal, contraceptive, nutrition,
                                injuries, other),
                      names_to = "Metric",
                      values_to = "Percentage"),
       aes(x = country, y = Percentage, fill = Metric)) +
  geom_bar(stat = "identity", position = "stack") +
  labs(title = "Percentage of Health Metrics by Country (EMR Region)",
       x = "Country",
       y = "Percentage") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "bottom") +
  scale_y_continuous(labels = scales::percent_format(scale = 1))
## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_bar()`).

For the EMR region it faces the same issue as the Americas, where there is a suprising lack of information. This could be stemmed to logistics issue where there are conflicts in countries inside the region such as Syria, Yemen, and Afghanistan. This could lead to medical issues in that region as well as not much access to treatment, making it hard for the people in those regions to report the data.

# Stacked Bar Chart of EUR Region
ggplot(clean_disease %>% filter(region == "EUR") %>%
         pivot_longer(cols = c(hiv_aids, tuberculosis, malaria,
                                neg_tropical, reproductive, maternal,
                                perinatal, contraceptive, nutrition,
                                injuries, other),
                      names_to = "Metric",
                      values_to = "Percentage"),
       aes(x = country, y = Percentage, fill = Metric)) +
  geom_bar(stat = "identity", position = "stack") +
  labs(title = "Percentage of Health Metrics by Country (EUR Region)",
       x = "Country",
       y = "Percentage") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "bottom") +
  scale_y_continuous(labels = scales::percent_format(scale = 1))
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_bar()`).

For the EUR region there is more readibly avaiable information elsewhere, perhaps from the Eurpoean Union, which may lead to less reporting to the WHO. As well as having a good healthcare system, leading to few reports about some of the issues that other regions may face.

# Stacked Bar Chart of SEAR Region
ggplot(clean_disease %>% filter(region == "SEAR") %>%
         pivot_longer(cols = c(hiv_aids, tuberculosis, malaria,
                                neg_tropical, reproductive, maternal,
                                perinatal, contraceptive, nutrition,
                                injuries, other),
                      names_to = "Metric",
                      values_to = "Percentage"),
       aes(x = country, y = Percentage, fill = Metric)) +
  geom_bar(stat = "identity", position = "stack") +
  labs(title = "Percentage of Health Metrics by Country (SEAR Region)",
       x = "Country",
       y = "Percentage") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "bottom") +
  scale_y_continuous(labels = scales::percent_format(scale = 1))

In the South East Asian Region, this could stem from health system imitations: Many countries in SEAR have under-resourced health systems with limited capacity for comprehensive data collection. This can lead to inconsistent or incomplete reporting of health metrics, especially in rural or underserved areas. As well as, a high disease burden: The region experiences a high burden of infectious diseases (e.g., tuberculosis, dengue, malaria) and non-communicable diseases (e.g., diabetes, cardiovascular diseases). The focus on addressing immediate health challenges can deprioritize routine data collection efforts, affecting data quality.

# Stacked Bar Chart of WPR Region
ggplot(clean_disease %>% filter(region == "WPR") %>%
         pivot_longer(cols = c(hiv_aids, tuberculosis, malaria,
                                neg_tropical, reproductive, maternal,
                                perinatal, contraceptive, nutrition,
                                injuries, other),
                      names_to = "Metric",
                      values_to = "Percentage"),
       aes(x = country, y = Percentage, fill = Metric)) +
  geom_bar(stat = "identity", position = "stack") +
  labs(title = "Percentage of Health Metrics by Country (WPR Region)",
       x = "Country",
       y = "Percentage") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "bottom") +
  scale_y_continuous(labels = scales::percent_format(scale = 1))
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_bar()`).

The WPR region could face issues from a digital divide, some people may have limited access to digital health technologies, resulting in a digital divide that can impact data collection quality and timeliness.

Section 2: Out-of-pocket (OOPS) and government healthcare spending (GGHE-D) influence ob healthcare outcomes across income groups and regions

Selecting the revelant columns from the data set for the analysis and removing missing values in these columns

data <- GHED_data

subset_data <- data %>%
  select(region, income, year, gghed_che, oops_che, che_gdp, che_pc_usd) %>%
  filter(!is.na(gghed_che) & !is.na(oops_che) & !is.na(che_gdp) & !is.na(che_pc_usd))

We group the data by region and income to calculate the mean values of government health expenditure (gghed_che), out-of-pocket expenditure (oops_che), and overall health expenditure (che_gdp and che_pc_usd). This helps us summarize the key variables by region and income level.

summary_stats <- subset_data %>%
  group_by(region, income) %>%
  summarise(
    mean_gghed_che = mean(gghed_che, na.rm = TRUE),
    mean_oops_che = mean(oops_che, na.rm = TRUE),
    mean_che_gdp = mean(che_gdp, na.rm = TRUE),
    mean_che_pc_usd = mean(che_pc_usd, na.rm = TRUE)
  )
## `summarise()` has grouped output by 'region'. You can override using the
## `.groups` argument.
# Print the summary stats
print(summary_stats)
## # A tibble: 19 × 6
## # Groups:   region [6]
##    region income       mean_gghed_che mean_oops_che mean_che_gdp mean_che_pc_usd
##    <chr>  <chr>                 <dbl>         <dbl>        <dbl>           <dbl>
##  1 AFR    High                   74.5          23.3         4.68           552. 
##  2 AFR    Low                    22.7          41.1         5.56            30.1
##  3 AFR    Lower-middle           36.1          41.3         4.73            83.5
##  4 AFR    Upper-middle           46.1          29.7         5.44           332. 
##  5 AMR    High                   53.2          30.8         7.70          2002. 
##  6 AMR    Lower-middle           43.7          39.2         6.68           143. 
##  7 AMR    Upper-middle           54.0          35.4         5.98           388. 
##  8 EMR    High                   74.1          17.2         3.48          1037. 
##  9 EMR    Low                    25.5          65.9         6.53            56.1
## 10 EMR    Lower-middle           39.9          48.1         5.06           197. 
## 11 EMR    Upper-middle           55.6          37.5         5.36           223. 
## 12 EUR    High                   72.7          20.0         8.01          3160. 
## 13 EUR    Lower-middle           40.6          54.3         6.08            90.3
## 14 EUR    Upper-middle           50.9          45.0         6.52           332. 
## 15 SEAR   Lower-middle           35.4          49.2         3.77            56.9
## 16 SEAR   Upper-middle           61.6          26.9         6.15           393. 
## 17 WPR    High                   72.3          15.7         7.91          1986. 
## 18 WPR    Lower-middle           51.2          23.7         5.68           115. 
## 19 WPR    Upper-middle           56.9          17.8         8.20           459.

We calculate correlations between:

This step helps to see how government spending or individual costs relate to overall health expenditure.

There is a negative correlation of -0.738 between government expenditure and out-of-pocket expenditure, which suggests that in countries where the government spends more on healthcare, individuals are required to pay less out-of-pocket for medical services and treatments. Essentially, greater government funding reduces the financial burden on citizens.

There is a positive correlation of 0.229 between government expenditure and health expenditure as % of GDP, which suggests the more the government contributes to healthcare, the more the country as a whole tends to spend on health relative to the size of its economy.

There is a negative correlation of -0.351 between out of pocket expenditure and health expenditure as % of GDP suggests that countries or regions with higher health expenditure as a percentage of GDP are likely to have lower out-of-pocket spending by individuals.

correlation_analysis <- subset_data %>%
  summarise(
    correlation_gghed_oops = cor(gghed_che, oops_che, use = "complete.obs"),
    correlation_gghed_che_gdp = cor(gghed_che, che_gdp, use = "complete.obs"),
    correlation_oops_che_gdp = cor(oops_che, che_gdp, use = "complete.obs")
  )

# Print the correlation analysis
print(correlation_analysis)
## # A tibble: 1 × 3
##   correlation_gghed_oops correlation_gghed_che_gdp correlation_oops_che_gdp
##                    <dbl>                     <dbl>                    <dbl>
## 1                 -0.739                     0.229                   -0.351

The scatter plot explores the relationship between government expenditure and out-of-pocket costs across income groups. A linear trend line is added to assess the general trend. Here it shows that government expenditure on health is higher in high income region than low income region, and vice versa.

ggplot(subset_data, aes(x = gghed_che, y = oops_che, color = income)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Relationship between Government Expenditure and Out-of-Pocket Expenditure",
    x = "Government Expenditure on Health (% of Current Health Expenditure)",
    y = "Out-of-Pocket Expenditure (% of Current Health Expenditure)",
    color = "Income Level"
  ) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

The bar chart shows average health expenditure per capita by income level and region, providing a clear comparison of spending levels. The drastic difference in average health expenditure per capita between high income level region and all other three regions shows a concerning reality of how wealth plays into an effect on health resource allocation.

ggplot(summary_stats, aes(x = income, y = mean_che_pc_usd, fill = region)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    title = "Average Health Expenditure per Capita by Income Level and Region",
    x = "Income Level",
    y = "Average Health Expenditure per Capita (USD)",
    fill = "Region"
  ) +
  theme_minimal()

The line chart tracks government health expenditure as a % of current health expenditure over time by income level, allowing us to see trends across different income groups from 2000-2020.

Unsurprinsgly, the high income regions are consistenly high on their expenditure on health at around 80% of CHE. Upper middle regions are experiencing a decrease in expenditure since 2020 and has since fallen to the same expenditure level as the lower middle regions in 2022. The low income regions are struggling to stay above their expenditure level at around 20% of CHE.

These dynamics highlight the need for targeted policies to enhance healthcare financing and infrastructure, particularly in lower-income regions, to ensure equitable access to health services and improve overall health outcomes.

ggplot(subset_data, aes(x = year, y = gghed_che, color = income)) +
  geom_line(stat = "summary", fun = "mean") +
  labs(
    title = "Trend of Government Health Expenditure (% of CHE) by Income Group (2000-2020)",
    x = "Year",
    y = "Government Expenditure on Health (% of Current Health Expenditure)",
    color = "Income Group"
  ) +
  theme_minimal()

The box plot identify detect outliers in out-of-pocket and government health expenditures across income levels and regions. This can help identify patterns and variances that warrant further investigation.

Out of pocket expenditure distributed by income level and region box plot shows that there are a few outliers in EUR and WPR region for the high income level and EMR region for upper middle income level.

ggplot(subset_data, aes(x = income, y = oops_che, fill = region)) +
  geom_boxplot() +
  labs(
    title = "Out-of-Pocket Expenditure Distribution by Income Level and Region",
    x = "Income Level",
    y = "Out-of-Pocket Expenditure (% of Current Health Expenditure)"
  ) +
  theme_minimal()

Government health expenditure distribution by income level and region box plot shows there are outliers across all four income level. There are outliers in EUR region for high income level; outliers in AFR region for low income level; outliers in AMR region lower middle income level; outliers in AFR region and WPR region for upper middle income level.

ggplot(subset_data, aes(x = income, y = gghed_che, fill = region)) +
  geom_boxplot() +
  labs(
    title = "Government Health Expenditure Distribution by Income Level and Region",
    x = "Income Level",
    y = "Government Health Expenditure (% of Current Health Expenditure)"
  ) +
  theme_minimal()

The outliers shows that that they are potentially skewing results or indicating areas requiring targeted policy interventions. For example the outliers seen in high income level from both the boxplots suggests that there is still a need for regulations to investigate high income regions despite already having the the highest government health expenditure as well as the lowest out of pocket expenditure.

Section 3: Out of Pocket Costs & Insurance Contributions

To begin analyzing the data, we will select a subset of variables and rename them in order to make downstream analysis easier. Displayed below is a subset of the new dataset which we will use to explore our question. The goal is to look at the proportion of OOP healthcare costs vs. social insurance contributions across different countries.

The scope of this analysis will include all countries with valid data. We will begin with looking at all years, and eventually narrow down to 2019. We choose 2019 as a baseline as it represents the most recent snapshot of pre-COVID medical expenses.

# Rename variables to have intuitive names
OOP_subset <- GHED_data %>%
  rename(
    current_health_expenditure_pct_gdp = che_gdp,
    gdp_per_capita_usd = gdp_pc_usd,
    primary_health_care_expenditure_pct_che = phc_che,
    total_expenditure = hf,
    govt_and_compulsory_financing = hf1,
    government_schemes = hf11,
    compulsory_contributory_health_insurance = hf12,
    social_health_insurance_schemes = hf121,
    compulsory_private_insurance_schemes = hf122,
    unspecified_compulsory_contributory_insurance = hf12nec,
    compulsory_medical_savings_accounts = hf13,
    unspecified_govt_compulsory_schemes = hf1nec,
    voluntary_health_payment_schemes = hf2,
    voluntary_health_insurance_schemes = hf21,
    non_profit_institutions_serving_households = hf22,
    enterprise_financing_schemes = hf23,
    unspecified_voluntary_payment_schemes = hf2nec,
    household_out_of_pocket_payments = hf3,
    rest_of_world_financing_schemes = hf4
  ) %>%
  select(
    country,
    region,
    year,
    income,
    total_expenditure,
    current_health_expenditure_pct_gdp,
    gdp_per_capita_usd,
    primary_health_care_expenditure_pct_che,
    govt_and_compulsory_financing,
    government_schemes,
    compulsory_contributory_health_insurance,
    social_health_insurance_schemes,
    compulsory_private_insurance_schemes,
    unspecified_compulsory_contributory_insurance,
    compulsory_medical_savings_accounts,
    unspecified_govt_compulsory_schemes,
    voluntary_health_payment_schemes,
    voluntary_health_insurance_schemes,
    non_profit_institutions_serving_households,
    enterprise_financing_schemes,
    unspecified_voluntary_payment_schemes,
    household_out_of_pocket_payments,
    rest_of_world_financing_schemes
  )

head(OOP_subset)
## # A tibble: 6 × 23
##   country region  year income       total_expenditure current_health_expenditu…¹
##   <chr>   <chr>  <dbl> <chr>                    <dbl>                      <dbl>
## 1 Algeria AFR     2000 Lower-middle           143870.                       3.49
## 2 Algeria AFR     2001 Lower-middle           162231.                       3.84
## 3 Algeria AFR     2002 Lower-middle           168702.                       3.73
## 4 Algeria AFR     2003 Lower-middle           189137.                       3.60
## 5 Algeria AFR     2004 Lower-middle           217929.                       3.54
## 6 Algeria AFR     2005 Lower-middle           244643.                       3.24
## # ℹ abbreviated name: ¹​current_health_expenditure_pct_gdp
## # ℹ 17 more variables: gdp_per_capita_usd <dbl>,
## #   primary_health_care_expenditure_pct_che <dbl>,
## #   govt_and_compulsory_financing <dbl>, government_schemes <dbl>,
## #   compulsory_contributory_health_insurance <dbl>,
## #   social_health_insurance_schemes <dbl>,
## #   compulsory_private_insurance_schemes <dbl>, …

As it currently stands, the variables we have are not very informative. We will first transform them into percentages. We consider the total sending to be the sum of out of pocket payments, contribution from social insurance, compulsory prepayment, and voluntary prepayment. We will also do some minor pre-cleaning of the data.

head(OOP_subset)
## # A tibble: 6 × 23
##   country region  year income       total_expenditure current_health_expenditu…¹
##   <chr>   <chr>  <dbl> <chr>                    <dbl>                      <dbl>
## 1 Algeria AFR     2000 Lower-middle           143870.                       3.49
## 2 Algeria AFR     2001 Lower-middle           162231.                       3.84
## 3 Algeria AFR     2002 Lower-middle           168702.                       3.73
## 4 Algeria AFR     2003 Lower-middle           189137.                       3.60
## 5 Algeria AFR     2004 Lower-middle           217929.                       3.54
## 6 Algeria AFR     2005 Lower-middle           244643.                       3.24
## # ℹ abbreviated name: ¹​current_health_expenditure_pct_gdp
## # ℹ 17 more variables: gdp_per_capita_usd <dbl>,
## #   primary_health_care_expenditure_pct_che <dbl>,
## #   govt_and_compulsory_financing <dbl>, government_schemes <dbl>,
## #   compulsory_contributory_health_insurance <dbl>,
## #   social_health_insurance_schemes <dbl>,
## #   compulsory_private_insurance_schemes <dbl>, …
# Calculate percentages of financing schemes relative to total health expenditure
OOP_subset <- OOP_subset %>%
  mutate(
    pct_govt_intervention = (govt_and_compulsory_financing / total_expenditure) * 100,
    pct_voluntary_protection = (voluntary_health_payment_schemes / total_expenditure) * 100,
    pct_household_out_of_pocket = (household_out_of_pocket_payments / total_expenditure) * 100
  )

str(OOP_subset)
## tibble [4,244 × 26] (S3: tbl_df/tbl/data.frame)
##  $ country                                      : chr [1:4244] "Algeria" "Algeria" "Algeria" "Algeria" ...
##  $ region                                       : chr [1:4244] "AFR" "AFR" "AFR" "AFR" ...
##  $ year                                         : num [1:4244] 2000 2001 2002 2003 2004 ...
##  $ income                                       : chr [1:4244] "Lower-middle" "Lower-middle" "Lower-middle" "Lower-middle" ...
##  $ total_expenditure                            : num [1:4244] 143870 162231 168702 189137 217929 ...
##  $ current_health_expenditure_pct_gdp           : num [1:4244] 3.49 3.84 3.73 3.6 3.54 ...
##  $ gdp_per_capita_usd                           : num [1:4244] 1780 1755 1795 2117 2625 ...
##  $ primary_health_care_expenditure_pct_che      : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ govt_and_compulsory_financing                : num [1:4244] 103539 123669 127002 145082 155532 ...
##  $ government_schemes                           : num [1:4244] 66056 81878 83246 95931 102717 ...
##  $ compulsory_contributory_health_insurance     : num [1:4244] 37483 41791 43756 49152 52815 ...
##  $ social_health_insurance_schemes              : num [1:4244] 37483 41791 43756 49152 52815 ...
##  $ compulsory_private_insurance_schemes         : num [1:4244] 0 0 0 0 0 0 0 0 0 0 ...
##  $ unspecified_compulsory_contributory_insurance: num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ compulsory_medical_savings_accounts          : num [1:4244] 0 0 0 0 0 0 0 0 0 0 ...
##  $ unspecified_govt_compulsory_schemes          : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ voluntary_health_payment_schemes             : num [1:4244] 3221 3409 3700 4055 5530 ...
##  $ voluntary_health_insurance_schemes           : num [1:4244] 1177 1375 1610 1855 3150 ...
##  $ non_profit_institutions_serving_households   : num [1:4244] 80 85 90.5 100 110 ...
##  $ enterprise_financing_schemes                 : num [1:4244] 1964 1949 2000 2100 2270 ...
##  $ unspecified_voluntary_payment_schemes        : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ household_out_of_pocket_payments             : num [1:4244] 37111 35153 38000 40000 56867 ...
##  $ rest_of_world_financing_schemes              : num [1:4244] NA NA NA NA NA NA NA NA NA NA ...
##  $ pct_govt_intervention                        : num [1:4244] 72 76.2 75.3 76.7 71.4 ...
##  $ pct_voluntary_protection                     : num [1:4244] 2.24 2.1 2.19 2.14 2.54 ...
##  $ pct_household_out_of_pocket                  : num [1:4244] 25.8 21.7 22.5 21.1 26.1 ...

We want to ensure that there are no missing values in our primary variables. It appears from below that our data in fairly clean.

# Remove rows with missing values in key variables
OOP_subset <- OOP_subset %>%
  filter(
    !is.na(pct_govt_intervention) &
    !is.na(pct_voluntary_protection) &
    !is.na(pct_household_out_of_pocket)
  )

str(OOP_subset)
## tibble [4,115 × 26] (S3: tbl_df/tbl/data.frame)
##  $ country                                      : chr [1:4115] "Algeria" "Algeria" "Algeria" "Algeria" ...
##  $ region                                       : chr [1:4115] "AFR" "AFR" "AFR" "AFR" ...
##  $ year                                         : num [1:4115] 2000 2001 2002 2003 2004 ...
##  $ income                                       : chr [1:4115] "Lower-middle" "Lower-middle" "Lower-middle" "Lower-middle" ...
##  $ total_expenditure                            : num [1:4115] 143870 162231 168702 189137 217929 ...
##  $ current_health_expenditure_pct_gdp           : num [1:4115] 3.49 3.84 3.73 3.6 3.54 ...
##  $ gdp_per_capita_usd                           : num [1:4115] 1780 1755 1795 2117 2625 ...
##  $ primary_health_care_expenditure_pct_che      : num [1:4115] NA NA NA NA NA NA NA NA NA NA ...
##  $ govt_and_compulsory_financing                : num [1:4115] 103539 123669 127002 145082 155532 ...
##  $ government_schemes                           : num [1:4115] 66056 81878 83246 95931 102717 ...
##  $ compulsory_contributory_health_insurance     : num [1:4115] 37483 41791 43756 49152 52815 ...
##  $ social_health_insurance_schemes              : num [1:4115] 37483 41791 43756 49152 52815 ...
##  $ compulsory_private_insurance_schemes         : num [1:4115] 0 0 0 0 0 0 0 0 0 0 ...
##  $ unspecified_compulsory_contributory_insurance: num [1:4115] NA NA NA NA NA NA NA NA NA NA ...
##  $ compulsory_medical_savings_accounts          : num [1:4115] 0 0 0 0 0 0 0 0 0 0 ...
##  $ unspecified_govt_compulsory_schemes          : num [1:4115] NA NA NA NA NA NA NA NA NA NA ...
##  $ voluntary_health_payment_schemes             : num [1:4115] 3221 3409 3700 4055 5530 ...
##  $ voluntary_health_insurance_schemes           : num [1:4115] 1177 1375 1610 1855 3150 ...
##  $ non_profit_institutions_serving_households   : num [1:4115] 80 85 90.5 100 110 ...
##  $ enterprise_financing_schemes                 : num [1:4115] 1964 1949 2000 2100 2270 ...
##  $ unspecified_voluntary_payment_schemes        : num [1:4115] NA NA NA NA NA NA NA NA NA NA ...
##  $ household_out_of_pocket_payments             : num [1:4115] 37111 35153 38000 40000 56867 ...
##  $ rest_of_world_financing_schemes              : num [1:4115] NA NA NA NA NA NA NA NA NA NA ...
##  $ pct_govt_intervention                        : num [1:4115] 72 76.2 75.3 76.7 71.4 ...
##  $ pct_voluntary_protection                     : num [1:4115] 2.24 2.1 2.19 2.14 2.54 ...
##  $ pct_household_out_of_pocket                  : num [1:4115] 25.8 21.7 22.5 21.1 26.1 ...

We will begin by exploring the basic distribution of our data.

ggplot(OOP_subset, aes(x = pct_govt_intervention)) +
  geom_histogram(binwidth = 5, fill = 'blue', color = 'black') +
  labs(
    title = 'Distribution of Government Based Schemes',
    x = 'Percentage (%)',
    y = 'Frequency'
  )

ggplot(OOP_subset, aes(x = pct_household_out_of_pocket)) +
  geom_histogram(binwidth = 5, fill = 'red', color = 'black') +
  labs(
    title = 'Distribution of Household Out-of-Pocket Payments',
    x = 'Percentage (%)',
    y = 'Frequency'
  )

ggplot(OOP_subset, aes(x = pct_voluntary_protection)) +
  geom_histogram(binwidth = 5, fill = 'green', color = 'black') +
  labs(
    title = 'Distribution of Voluntary Payments',
    x = 'Percentage (%)',
    y = 'Frequency'
  )

Now is a good time to explain what exactly each one of these variables really mean.

Firstly, the distribution of government-based schemes shows how much of total healthcare spending is funded through the government. A high percentage here indicates that healthcare programs are more tax-funded as opposed to individually funded.

Secondly, the distribution of out of pocket payments represents the total health care expenditure that is paid directly at the point of service. This means that premiums for insurance or the tax dollars that are paid are not included in this category.

Finally, the distribution of voluntary payments shows the relative amount of health expenditures coming from sources that people opt into. The easiest example is supplemental health insurance where people can choose what plan they want.

Now we will look into some descriptive statistics by year.

# Calculate basic statistical metrics for the new variables grouped by year
yearly_stats <- OOP_subset %>%
  group_by(year) %>%
  summarise(
    mean_govt_intervention = mean(pct_govt_intervention, na.rm = TRUE),
    median_govt_intervention = median(pct_govt_intervention, na.rm = TRUE),
    sd_govt_intervention = sd(pct_govt_intervention, na.rm = TRUE),
    mean_out_of_pocket = mean(pct_household_out_of_pocket, na.rm = TRUE),
    median_out_of_pocket = median(pct_household_out_of_pocket, na.rm = TRUE),
    sd_out_of_pocket = sd(pct_household_out_of_pocket, na.rm = TRUE),
    mean_voluntary_protection = mean(pct_voluntary_protection, na.rm = TRUE),
    median_voluntary_protection = median(pct_voluntary_protection, na.rm = TRUE),
    sd_voluntary_protection = sd(pct_voluntary_protection, na.rm = TRUE)
  )

head(yearly_stats)
## # A tibble: 6 × 10
##    year mean_govt_intervention median_govt_intervention sd_govt_intervention
##   <dbl>                  <dbl>                    <dbl>                <dbl>
## 1  2000                   52.8                     51.1                 22.0
## 2  2001                   52.6                     51.4                 21.6
## 3  2002                   52.4                     52.0                 21.9
## 4  2003                   52.7                     52.3                 21.7
## 5  2004                   52.2                     52.7                 21.3
## 6  2005                   52.5                     52.9                 20.7
## # ℹ 6 more variables: mean_out_of_pocket <dbl>, median_out_of_pocket <dbl>,
## #   sd_out_of_pocket <dbl>, mean_voluntary_protection <dbl>,
## #   median_voluntary_protection <dbl>, sd_voluntary_protection <dbl>

Looking at the sample size for each datapoint, we can see that there is a significant drop at the final time point. This final time point is 2022 with only 20 samples. Thus, we shall remove 2022 from downstream analysis.

# Calculate the number of observations per year, removing rows with NAs
obs_count_per_year <- OOP_subset %>%
  group_by(year) %>%
  summarise(observation_count = n())

# Plot the number of observations over time
ggplot(obs_count_per_year, aes(x = year, y = observation_count)) +
  geom_line(color = "blue", size = 1) +
  geom_point(color = "red", size = 2) +
  labs(
    title = "Number of Observations Over Time",
    x = "Year",
    y = "Number of Observations"
  )
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Utilizing these yearly aggregated statistics, let us observe how the following have changed over time for each one of our variables:

  1. How the median has changed over time. This will indicate relative trends for each of our variables.

  2. How the coefficient of variation has changed over time. This will indicate if countries are either conforming to similar policies or if, instead, they are becoming more disparate.

yearly_stats <- yearly_stats %>%
  filter(year != 2022)

ggplot(yearly_stats, aes(x = year)) +
  geom_line(aes(y = median_govt_intervention, color = "Govt Intervention"), size = 1) +
  geom_line(aes(y = median_out_of_pocket, color = "Out-of-Pocket"), size = 1) +
  geom_line(aes(y = median_voluntary_protection, color = "Voluntary Protection"), size = 1) +
  labs(
    title = "Median of Health Financing Variables Over Time",
    x = "Year",
    y = "Median (%)",
    color = "Variable"
  )

yearly_stats$cv_govt_intervention <- yearly_stats$sd_govt_intervention / yearly_stats$mean_govt_intervention
yearly_stats$cv_out_of_pocket <- yearly_stats$sd_out_of_pocket / yearly_stats$mean_out_of_pocket
yearly_stats$cv_voluntary_protection <- yearly_stats$sd_voluntary_protection / yearly_stats$mean_voluntary_protection

ggplot(yearly_stats, aes(x = year)) +
  geom_line(aes(y = cv_govt_intervention, color = "Govt Intervention"), size = 1) +
  geom_line(aes(y = cv_out_of_pocket, color = "Out-of-Pocket"), size = 1) +
  geom_line(aes(y = cv_voluntary_protection, color = "Voluntary Protection"), size = 1) +
  labs(
    title = "Coefficient of Variation of Health Financing Variables Over Time",
    x = "Year",
    y = "Coefficient of Variation",
    color = "Variable"
  )

For the final portion of this analysis, we will look at the relationship between percent spending on government intervention and out of pocket costs. We will also investigate how voluntary protection can protect consumers.

# Scatter plot of Social Health Insurance vs. Out-of-Pocket Payments
ggplot(OOP_subset, aes(x = pct_govt_intervention, y = pct_household_out_of_pocket, color = income)) +
  geom_point() +
  geom_smooth(method = 'lm', color = 'blue') +
  labs(
    title = 'Out-of-Pocket Payments vs. Government Intervention',
    x = 'Government Intervention (% of Total Health Expenditure)',
    y = 'Out-of-Pocket Payments (% of Total Health Expenditure)'
  )
## `geom_smooth()` using formula = 'y ~ x'

# Scatter plot of Social Health Insurance vs. Out-of-Pocket Payments
ggplot(OOP_subset, aes(x = pct_govt_intervention, y = pct_voluntary_protection, color = income)) +
  geom_point() +
  geom_smooth(method = 'lm', color = 'blue') +
  labs(
    title = 'Voluntary Protection Contribution vs. Government Intervention Contribution',
    x = 'Government Intervention (% of Total Health Expenditure)',
    y = 'Voluntary Protection Contribution (% of Total Health Expenditure)'
  )
## `geom_smooth()` using formula = 'y ~ x'

The analysis above has opened up potentially interesting question for further investigation. Trivally, we have shown that as the magnitude of government spending on healthcare decreases, the amount of out of pocket spending on the behalf of the individual increases. However, we notice that there are some poorer countries that are able to have less government support, but still have out of pocket costs that are comparable to those of higher income countries. Investigating why this is the case may prove fruitful. Furthermore, it seems that some low income countries are able to supplement low government intervention with voluntary protection plans. However, there are others that are unable to. Identifying what characteristics distinguish these two types of countries may also prove to be an interesting further line of inquiry.

Summary

Vaccine and specific healthcare spending data is hard to come across in this dataset. There are many missing values which makes it difficult to make meaningful comparisons. However, it is interesting to note that we are able to observe how different regions have different predominant diseases.

We then looked into how different country types allocate funds towards healthcare based on region and income level. There were stark differences and an abundance of data present. We were also able to observe outliers which means that there are likely other factors that influence how money is allocated towards healthcare.

Finally, we looked specifically towards how insurance is utilized to offset out of pocket costs in countries with low government healthcare spending. Some low income countries are able to brace healthcare costs utilizing private insurance. However, others are unable to do so. Exploring why this is the case may also prove to be a fruitful line of inquiry.

Citations:

World Health Organization (WHO). (2023). Global Health Expenditure Database. World Health Organization. Retrieved from https://apps.who.int/nha/database.